Concrete_Data <- read_excel("Concrete_Data.xls")
colnames(Concrete_Data) <- c("Cement", "Blast Furnace Slag", "Fly Ash",
"Water", "Superplasticizer", "Coarse Aggregate",
"Fine Aggregate", "Age", "Concrete Compressive Strength")
sum(is.na(Concrete_Data))
## [1] 0
We loaded the data into R from the Excel file it was originally stored in. The column names were all too long, so we changed them to make them more friendly to use. We then checked to see if there were any missing values in the dataset. It makes sense that there weren’t any as this data was taken from the UCI Machine Learning Repository and was already cleaned.
set.seed(123)
trainingSetIndex <- createDataPartition(Concrete_Data$`Concrete Compressive Strength`, p = 0.75, list = FALSE)
trainData <- Concrete_Data[trainingSetIndex, ]
testData <- Concrete_Data[-trainingSetIndex, ]
Before we could start building the model, we had to randomly partition the data into two separate matrices. The first partition contained 75% of the data and would be used to train the model. The second partition contained 25% of the data and would be used to test the accuracy of our predictive model.
mod3 <- train(`Concrete Compressive Strength` ~ ., data = trainData, method = "lm", preProcess = c("scale", "center"), trControl = trainControl("none"))
mod3_training <- predict(mod3, trainData)
mod3_testing <- predict(mod3, testData)
We wanted to see how the model would perform before doing any tranformations or further analysis. So, we used the training data to create a linear model regressing concrete compressive strength against all the independent variables. The data was preprocessed by standardizing the values. We then applied the model to make predictions based on the training data and the testing data.
trainingDF_mod3 <- data.frame(trainData$`Concrete Compressive Strength`, mod3_training)
a = ggplot(data = trainingDF_mod3, aes(x = trainData..Concrete.Compressive.Strength.,
y = mod3_training)) +
geom_point(alpha = 5/8) + geom_smooth(method = "lm", se = FALSE, color = "red") +
theme_bw() + ggtitle("Concrete Compressive Strength (in mPa)") +
xlab("True Training Values") + ylab("Training Values Predicted by the Model") +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 16))
ggplotly(a)
Shown above is the plot of the actual concrete compressive
strength values in the training dataset (x-axis) versus the
concrete compressive strength values predicted by our model
(y-axis). If our model had predicted the values with 100%
accuracy, the points on the graph would have a perfect
correlation. Obviously, that is not the case. As we move farther
right on the x-axis, the data appear to spread out quite a bit.
testingDF_mod3 <- data.frame(testData$`Concrete Compressive Strength`, mod3_testing)
b = ggplot(data = testingDF_mod3, aes(x = testData..Concrete.Compressive.Strength.,
y = mod3_testing)) +
geom_point(alpha = 5/8) + geom_smooth(method = "lm", se = FALSE, color = "red") +
theme_bw() + ggtitle("Concrete Compressive Strength (in mPa)") +
xlab("True Training Values") + ylab("Training Values Predicted by the Model") +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 16))
ggplotly(b)
Shown above is the plot of the actual concrete compressive
strength values in the testing dataset (x-axis) versus the
concrete compressive strength values predicted by our model
(y-axis). We see a pattern here that is similar to the previous
plot. The variation in the data appears to increase as we move
farther right on the x-axis.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.544 -6.502 0.606 6.626 34.618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.8901 0.3764 95.362 < 2e-16 ***
## Cement 11.9899 1.0381 11.550 < 2e-16 ***
## `\\`Blast Furnace Slag\\`` 8.2480 1.0142 8.133 1.69e-15 ***
## `\\`Fly Ash\\`` 5.0552 0.9465 5.341 1.22e-07 ***
## Water -3.5324 0.9781 -3.612 0.000324 ***
## Superplasticizer 1.9282 0.6500 2.967 0.003105 **
## `\\`Coarse Aggregate\\`` 1.0195 0.8509 1.198 0.231227
## `\\`Fine Aggregate\\`` 1.0628 0.9799 1.085 0.278437
## Age 7.3198 0.4012 18.244 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.47 on 765 degrees of freedom
## Multiple R-squared: 0.6181, Adjusted R-squared: 0.6141
## F-statistic: 154.8 on 8 and 765 DF, p-value: < 2.2e-16
Looking at the summary, we see the model has an intercept of 35.8901 and two of the most influential variables are cement and blast furnace slag, whereas two of the least influential variables are coarse aggregate and fine aggregate. This has a mediocre Adjusted R-squared value of 0.6141, which implies roughly 40% of the variation in the data cannot be explained by the model.
The residuals of the model appear to have a normal distribution and are symmetric about 0, which suggests that our model fits the data well.
## Cement Blast Furnace Slag Fly Ash
## Cement 1.00000000 -0.27519344 -0.397475440
## Blast Furnace Slag -0.27519344 1.00000000 -0.323569468
## Fly Ash -0.39747544 -0.32356947 1.000000000
## Water -0.08154361 0.10728594 -0.257043997
## Superplasticizer 0.09277137 0.04337574 0.377339559
## Coarse Aggregate -0.10935604 -0.28399823 -0.009976788
## Fine Aggregate -0.22272017 -0.28159326 0.079076351
## Age 0.08194726 -0.04424580 -0.154370165
## Concrete Compressive Strength 0.49783272 0.13482445 -0.105753348
## Water Superplasticizer
## Cement -0.08154361 0.09277137
## Blast Furnace Slag 0.10728594 0.04337574
## Fly Ash -0.25704400 0.37733956
## Water 1.00000000 -0.65746444
## Superplasticizer -0.65746444 1.00000000
## Coarse Aggregate -0.18231167 -0.26630276
## Fine Aggregate -0.45063498 0.22250149
## Age 0.27760443 -0.19271652
## Concrete Compressive Strength -0.28961348 0.36610230
## Coarse Aggregate Fine Aggregate Age
## Cement -0.109356039 -0.22272017 0.081947264
## Blast Furnace Slag -0.283998230 -0.28159326 -0.044245801
## Fly Ash -0.009976788 0.07907635 -0.154370165
## Water -0.182311668 -0.45063498 0.277604429
## Superplasticizer -0.266302755 0.22250149 -0.192716518
## Coarse Aggregate 1.000000000 -0.17850575 -0.003015507
## Fine Aggregate -0.178505755 1.00000000 -0.156094049
## Age -0.003015507 -0.15609405 1.000000000
## Concrete Compressive Strength -0.164927821 -0.16724896 0.328876976
## Concrete Compressive Strength
## Cement 0.4978327
## Blast Furnace Slag 0.1348244
## Fly Ash -0.1057533
## Water -0.2896135
## Superplasticizer 0.3661023
## Coarse Aggregate -0.1649278
## Fine Aggregate -0.1672490
## Age 0.3288770
## Concrete Compressive Strength 1.0000000
In an attempt to improve our predictions, we decided to look at the correlations between every combination of the different variables. We wanted to know if any of the independent variables had a strong correlation with concrete compressive strength as well as if there was any interaction between two of the independent variables. Upon plotting age versus concrete compressive strength, we noticed there was a logarithmic relationship between the two variables. We also noticed a decent correlation between both water and superplasticizer as well as water and fine aggregate.
mod4 <- train(`Concrete Compressive Strength` ~ Cement + `Blast Furnace Slag` + sqrt(`Fly Ash`) + Water + sqrt(Superplasticizer) + `Coarse Aggregate` + `Fine Aggregate` + log(Age) + Water*Superplasticizer + Water*`Fine Aggregate`, data = trainData, method = "lm", preProcess = c("scale", "center"), trControl = trainControl("none"))
mod4_training <- predict(mod4, trainData)
mod4_testing <- predict(mod4, testData)
Based on the discoveries made about the relationships between some of the variables, we attempted to improve our model by taking the natural log of age and adding two interaction terms to our model. We again used the training data to create a linear model regressing concrete compressive strength against the independent variables and interaction terms. The data was preprocessed by standardizing the values. We then applied the model to make predictions based on the training data and the testing data.
trainingDF_mod4 <- data.frame(trainData$`Concrete Compressive Strength`, mod4_training)
c = ggplot(data = trainingDF_mod4, aes(x = trainData..Concrete.Compressive.Strength.,
y = mod4_training)) +
geom_point(alpha = 5/8) + geom_smooth(method = "lm", se = FALSE, color = "red") +
theme_bw() + ggtitle("Concrete Compressive Strength (in mPa)") +
xlab("True Training Values") + ylab("Training Values Predicted by the Model") +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 16))
ggplotly(c)
Shown above is the plot of the actual concrete compressive
strength values in the training dataset (x-axis) versus the
concrete compressive strength values predicted by our improved
model (y-axis). As you can see, there is a strong correlation
present in the data and, unlike our first model, the variation
stays fairly constant.
testingDF_mod4 <- data.frame(testData$`Concrete Compressive Strength`, mod4_testing)
d = ggplot(data = testingDF_mod4, aes(x = testData..Concrete.Compressive.Strength., y = mod4_testing)) +
geom_point(alpha = 5/8) + geom_smooth(method = "lm", se = FALSE, color = "red") +
theme_bw() + ggtitle("Concrete Compressive Strength (in mPa)") +
xlab("True Testing Values") + ylab("Testing Values Predicted by the Model") +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 16))
ggplotly(d)
Shown above is the plot of the actual concrete compressive
strength values in the testing dataset (x-axis) versus the
concrete compressive strength values predicted by our improved
model (y-axis). We see a pattern here that is similar to the
previous plot. The variation appears to remain constant and
there is a strong and obvious positive correlation in the data.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.5786 -4.4243 -0.2063 4.1486 25.3419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.8901 0.2423 148.122 < 2e-16 ***
## Cement 12.8595 0.6120 21.012 < 2e-16 ***
## `\\`Blast Furnace Slag\\`` 8.3504 0.6221 13.423 < 2e-16 ***
## `sqrt(\\`Fly Ash\\`)` 3.4522 0.6484 5.324 1.33e-07 ***
## Water -7.8561 2.2810 -3.444 0.000604 ***
## `sqrt(Superplasticizer)` 12.5998 1.4290 8.817 < 2e-16 ***
## `\\`Coarse Aggregate\\`` 1.3060 0.5087 2.568 0.010433 *
## `\\`Fine Aggregate\\`` -4.1261 2.3197 -1.779 0.075682 .
## `log(Age)` 10.3688 0.2526 41.040 < 2e-16 ***
## Superplasticizer 4.3590 2.8637 1.522 0.128384
## `Water:Superplasticizer` -13.5814 3.3277 -4.081 4.95e-05 ***
## `Water:\\`Fine Aggregate\\`` 6.4522 2.5733 2.507 0.012369 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.741 on 762 degrees of freedom
## Multiple R-squared: 0.8423, Adjusted R-squared: 0.84
## F-statistic: 370 on 11 and 762 DF, p-value: < 2.2e-16
Looking at the summary, we see the model still has an intercept of 35.8901. However, the two most influential variables have changed to superplasticizer and the interaction term between water and superplasticizer. The two least influential variables are now coarse aggregate and fly ash. This model has a decent (and much improved) Adjusted R-squared value of 0.84, which implies roughly 16% of the variation in the data cannot be explained by the model.
The residuals of the model appear to have a normal distribution and are symmetric about 0, which suggests that our model fits the data well.
## Basic Model Prediction Improved Model Prediction
## 5 61.56700 47.620266
## 10 31.93544 35.517863
## 23 20.94199 6.852754
## 89 50.93369 40.077095
## Actual Concrete Compressive Strength
## 5 44.296075
## 10 39.289790
## 23 8.063422
## 89 35.301171
For comparison, we have included some values of the concrete compressive strength as predicted by both of our models and the actual value for some given rows in the raw dataset. As you can see, the predictions have improved thanks to the new model. However, this isn’t true for every data point.
Using our data manipulation techniques, we were able to estimate the Concrete Compressive Strength based on the age and quantity of certain ingredients (input variables).
I-Cheng Yeh, “Modeling of strength of high performance concrete using artificial neural networks,” Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).